Introduction: This report explores a credit card fraud detection dataset from Kaggle.com. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. It contains only numerical input variables which are the result of a PCA transformation. Features V1, V2, … V28 are the principal components obtained with PCA, the only features which have not been transformed with PCA are ‘Time’ and ‘Amount’. Feature ‘Time’ contains the seconds elapsed between each transaction and the first transaction in the dataset. The feature ‘Amount’ is the transaction Amount. Feature ‘Class’ is the response variable and it takes value 1 in case of fraud and 0 otherwise.
Summary the dataset:
## Time V1 V2
## Min. : 0 Min. :-56.40751 Min. :-72.71573
## 1st Qu.: 54202 1st Qu.: -0.92037 1st Qu.: -0.59855
## Median : 84692 Median : 0.01811 Median : 0.06549
## Mean : 94814 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.:139321 3rd Qu.: 1.31564 3rd Qu.: 0.80372
## Max. :172792 Max. : 2.45493 Max. : 22.05773
## V3 V4 V5
## Min. :-48.3256 Min. :-5.68317 Min. :-113.74331
## 1st Qu.: -0.8904 1st Qu.:-0.84864 1st Qu.: -0.69160
## Median : 0.1799 Median :-0.01985 Median : -0.05434
## Mean : 0.0000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 1.0272 3rd Qu.: 0.74334 3rd Qu.: 0.61193
## Max. : 9.3826 Max. :16.87534 Max. : 34.80167
## V6 V7 V8
## Min. :-26.1605 Min. :-43.5572 Min. :-73.21672
## 1st Qu.: -0.7683 1st Qu.: -0.5541 1st Qu.: -0.20863
## Median : -0.2742 Median : 0.0401 Median : 0.02236
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.3986 3rd Qu.: 0.5704 3rd Qu.: 0.32735
## Max. : 73.3016 Max. :120.5895 Max. : 20.00721
## V9 V10 V11
## Min. :-13.43407 Min. :-24.58826 Min. :-4.79747
## 1st Qu.: -0.64310 1st Qu.: -0.53543 1st Qu.:-0.76249
## Median : -0.05143 Median : -0.09292 Median :-0.03276
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.59714 3rd Qu.: 0.45392 3rd Qu.: 0.73959
## Max. : 15.59500 Max. : 23.74514 Max. :12.01891
## V12 V13 V14
## Min. :-18.6837 Min. :-5.79188 Min. :-19.2143
## 1st Qu.: -0.4056 1st Qu.:-0.64854 1st Qu.: -0.4256
## Median : 0.1400 Median :-0.01357 Median : 0.0506
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.6182 3rd Qu.: 0.66251 3rd Qu.: 0.4931
## Max. : 7.8484 Max. : 7.12688 Max. : 10.5268
## V15 V16 V17
## Min. :-4.49894 Min. :-14.12985 Min. :-25.16280
## 1st Qu.:-0.58288 1st Qu.: -0.46804 1st Qu.: -0.48375
## Median : 0.04807 Median : 0.06641 Median : -0.06568
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.64882 3rd Qu.: 0.52330 3rd Qu.: 0.39968
## Max. : 8.87774 Max. : 17.31511 Max. : 9.25353
## V18 V19 V20
## Min. :-9.498746 Min. :-7.213527 Min. :-54.49772
## 1st Qu.:-0.498850 1st Qu.:-0.456299 1st Qu.: -0.21172
## Median :-0.003636 Median : 0.003735 Median : -0.06248
## Mean : 0.000000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.500807 3rd Qu.: 0.458949 3rd Qu.: 0.13304
## Max. : 5.041069 Max. : 5.591971 Max. : 39.42090
## V21 V22 V23
## Min. :-34.83038 Min. :-10.933144 Min. :-44.80774
## 1st Qu.: -0.22839 1st Qu.: -0.542350 1st Qu.: -0.16185
## Median : -0.02945 Median : 0.006782 Median : -0.01119
## Mean : 0.00000 Mean : 0.000000 Mean : 0.00000
## 3rd Qu.: 0.18638 3rd Qu.: 0.528554 3rd Qu.: 0.14764
## Max. : 27.20284 Max. : 10.503090 Max. : 22.52841
## V24 V25 V26
## Min. :-2.83663 Min. :-10.29540 Min. :-2.60455
## 1st Qu.:-0.35459 1st Qu.: -0.31715 1st Qu.:-0.32698
## Median : 0.04098 Median : 0.01659 Median :-0.05214
## Mean : 0.00000 Mean : 0.00000 Mean : 0.00000
## 3rd Qu.: 0.43953 3rd Qu.: 0.35072 3rd Qu.: 0.24095
## Max. : 4.58455 Max. : 7.51959 Max. : 3.51735
## V27 V28 Amount Class
## Min. :-22.565679 Min. :-15.43008 Min. : 0.00 0:284315
## 1st Qu.: -0.070840 1st Qu.: -0.05296 1st Qu.: 5.60 1: 492
## Median : 0.001342 Median : 0.01124 Median : 22.00
## Mean : 0.000000 Mean : 0.00000 Mean : 88.35
## 3rd Qu.: 0.091045 3rd Qu.: 0.07828 3rd Qu.: 77.17
## Max. : 31.612198 Max. : 33.84781 Max. :25691.16
The raw dataset consists of 284807 transcation records of which 492 records are fraudulent. There is no missing value in the dataset. I also found 1081 duplicate records in the dataset. These duplicates will be removed when I create a t-SNE plot so as to prevent erroneous messages.
Explore the Class:
The dataset is highly imbalanced, the fraudulent records account for only 0.172% of all transactions.
Explore the Time:
Histogram of Time per minute
Histogram of Time per hour
The density of Time
The largest number of Time is 172792 second which roughly equals to 48 hours. It looks like there are two peaks as well as two saddles during these two days. I assume the peak time occurs at daytime and the saddle period occurs at night. I wonder if I can transform the Time into hour, a categoical variable to represent the hours in one day. Assuming the time starts from 12:00am.
The time 9:00-22:00 is a rush hour when most of the transaction committed.
Explore the Amout:
Histogram of Amount
The distribution of Amount is highly skewed. After plotting on a log scale, it appears a normal-like bimodal distribution.
Explore V1-V28
Let’s plot the histograms of V1-V28.
Boxplot of V1-V28
Can’t see the box? Let’s make another boxplot of V1-V28 with most outliers removed.
The plots show most distributions are low skewness with zero mean, some of them are high kurtosis e.g. V28, some distributions are close to normal e.g. V13. V1 is high left-skewed, I make a transformation function log10(-x+3) that converts the long tail into a better shape.
This plot shows the V1 after transformation. It appears three peaks where the data cluster.
The dataset contains 284807 transaction records in two days. The transactions are ordered by Time. The fraudulent transactions account for only 0.172% of all transactions. The median and mean of the transaction amount are both less than 100, the maximum amount is 25691.16. V1-V28 are zero mean distributions with either high skewed or high kurtosis.
The main features are Class and all independent variables which might be useful to predict the frauds.
The Time may help support to detect the frauds. I wonder if the fraud has a different distribution compared to normal transaction, for example, more frauds occur at night.
The Time counts the second elapsed between the current transaction and the first transaction. I need to transform the Time to a meaningful variable other than just counting number. The peak time of transactions seems periodic with a 24-hour cycle. So I create a categorical Hour variable which extracts the calcuated hour of a day from the counting time, assuming the first transaction occurs on 12:00am.
I log-transformed the left-skewed V1 and right-skewed Amount to visualize the data easily. The transformed V1 appear a distribution with three peaks.
Explore Time vs Class
The plots show the frauds have a different distribution on Time compared to normal transactions. The number of frauds on daytime is a bit higher than at night, but it does not have a significant drop-down at night.
Explore Hour vs Class
Clearly, the frauds can occur any time in a day. However, it looks like the time does not tell a rule to distinguish fraud and nonfraud transactions because there are still thousands of normal transactions at night.
Explore Amount and transformed Amount vs Class
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 9.25 122.20 105.90 2126.00
It seems the distributions of transaction amount of fraud or nonfraud are similar.
Explore V1-V28 vs Class
The density distribution of V1-V28 by Class
The plots show the distribution of frauds have a lower kurtosis (more flatten) compared to the normal transactions. Some features have a apparantly different median of values on the frauds. I think those features with less overlapping area under the density functions can be useful to detect the frauds.
The plot shows most of transactions amount are less than 200, the median amount is various around 20 during a day.
It seems the frauds can occur uniformly anytime in a day, not relied on day and night. Since the number of normal transactions drops down at night, the probablity that a transaction is a fraud will slightly increase at night.
The smallest amount of fraud is 0 and the larget amount of fraud is 2126. I don’t see any specific amount that has a significantly higher probability indicating it is a fraud.
The features V1-V28 seem more informed because most of these features show different distributions between fraud and nonfraud.
The median amount of transactions at day is higher than night. The daytime transactions tend to have higher both number and amount.
The features V1-V7 V9-V12 V14 V16-V19 V21 have apparently distinct shapes of density across two Class. I think these features are very important to detect the frauds.
Amount vs Time by Class
The red points show the occurance of a fraud. It looks like the reds points are always surrounded by white points so that we can’t conclude any patten that frauds behave differently from normality. The Amount may not be useful for fraud detection.
Time series plot V1-V28 by Class
Looking at the red points, if they are not surrounded by or far away from any white point, I think a surpervised learning model is able to draw a boundary to separate the frauds. Based on the plots above, I would like to select the better features V1-V5 V7-V12 V14 V16-V18 which clearly separated the most red points from the white point clusters.
There are a couple of features that have a clear shift during a specific time in a day e.g. V12 V13. I am curious about the hours when the shift occurs. I think the Hour is a useful feature that should be included in the model.
Let’s explore V12 and V13.
The plots show the shifts occur on time everyday from 1:00 to 7:00. Interestingly, those transactions ‘forget’ to shift V12 value back to normal at daytime, are probably being regarded as frauds.
Pairs plot of all features by Class
## [1] "Time" "V1" "V2" "V3" "V4" "V5"
## [7] "V6" "V7" "V8" "V9" "V10" "V11"
## [13] "V12" "V13" "V14" "V15" "V16" "V17"
## [19] "V18" "V19" "V20" "V21" "V22" "V23"
## [25] "V24" "V25" "V26" "V27" "V28" "Amount"
## [31] "Class" "Hour" "Amount_A" "V1_A"
The image size is very large, I’ve saved a high resolution version here
The pairs plot shows that the normal transactions do not have significant correlation between features. However, the frauds have some features correlated.
Explore correlations
The plots show that the features are almost not correlated for the normal transactions. However, the frauds have strong correlations among these features V1-V5 V7 V9-V12 V14 V16-V19.
t-SNE plot
## Read the 10473 x 39 data matrix successfully!
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Normalizing input...
## Building tree...
## - point 0 of 10473
## - point 10000 of 10473
## Done in 8.19 seconds (sparsity = 0.012363)!
## Learning embedding...
## Iteration 50: error is 97.811603 (50 iterations in 4.80 seconds)
## Iteration 100: error is 88.920287 (50 iterations in 5.07 seconds)
## Iteration 150: error is 84.392067 (50 iterations in 4.91 seconds)
## Iteration 200: error is 83.664107 (50 iterations in 4.86 seconds)
## Iteration 250: error is 83.335895 (50 iterations in 4.86 seconds)
## Iteration 300: error is 3.075141 (50 iterations in 4.59 seconds)
## Iteration 350: error is 2.651152 (50 iterations in 4.62 seconds)
## Iteration 400: error is 2.411062 (50 iterations in 4.46 seconds)
## Iteration 450: error is 2.251608 (50 iterations in 4.53 seconds)
## Iteration 500: error is 2.136952 (50 iterations in 4.54 seconds)
## Iteration 550: error is 2.049831 (50 iterations in 4.58 seconds)
## Iteration 600: error is 1.981499 (50 iterations in 4.61 seconds)
## Iteration 650: error is 1.926512 (50 iterations in 4.56 seconds)
## Iteration 700: error is 1.882047 (50 iterations in 4.61 seconds)
## Iteration 750: error is 1.846762 (50 iterations in 4.61 seconds)
## Iteration 800: error is 1.818805 (50 iterations in 4.62 seconds)
## Iteration 850: error is 1.797712 (50 iterations in 4.74 seconds)
## Iteration 900: error is 1.781142 (50 iterations in 4.75 seconds)
## Iteration 950: error is 1.768946 (50 iterations in 4.78 seconds)
## Iteration 1000: error is 1.759527 (50 iterations in 5.54 seconds)
## Fitting performed in 94.64 seconds.
I choose the features V1-V5 V7 V9-V12 V14 V16-V18 and Hour to run a t-SNE algorithm since these features show up stronger fraud patterns. The t-SNE plot contains all fraud points and 10000 samples of nonfraud. The plot shows two major clusters of frauds (upper left and lower left) as well as other individual fraud whose pattern or features may look very similar to normal transactions so as hard to be identified.
The time series plots of features are more helpful to see the transaction distribution vary during a day. I will select the most useful features V1-V5 V7 V9-V12 V14 V16-V18 for fraud detection.
From the correlation heat matrix, I see some features are highly correlated e.g. V16-V18. To avoid redundancy, I think some correlated features can be dropped from a model.
The features like V12 V13 have a periodic shift at 1:00-7:00 everyday, also the distributions are various when the shift occurs.
Yes. I create a script based on Python. The script is to build up a baseline neural network model for fraud detection.
The model scores around 0.8 AUPRC and is able to detect about 80% of frauds without interfering many customers. However, increasing the rate above 80% is very difficult because a huge number of customers would be inspected while only a few more frauds would be discovered.
Plot one shows the Amount of transaction during two days, the red points are fraudulent transactions.
Plot two indicates a distribution shift on V12 from 1:00 to 7:00.
The t-SNE plot reduce the high feature dimension into two. The plot shows two clusters of red points which are fraudulent transactions.
The creditcard data set contains two days of transaction within only 0.172% frauds. I start by exploring individual features and the relationships on multiple features, eventually select the best features into a model. I also build up a baseline model which is able to detect 80% of frauds without interfering many customers.
I struggled selecting the best features that can distinguish frauds as much as possible. Some features are strongly correlated but I don’t have any background information besides Time and Amount to explain the correlations. I am still looking for high dimension visualization tools to better see any hidden fraud pattern across all features.
Due to the frauds are very rare, I am using AUPRC as the metric to evaluate a model. My model can achieve average 0.8 score as well as detect 80% of frauds. Anyway I think it’s very difficult to make a breakthrough above this score. The remaining 20% of frauds, unfortunately they do a nice job on camouflage, of which the values of V1-V28 are all close to zero the mean of normal transactions. Hence, I assert the existing features are not sufficient to uncover all frauds. Collecting more features and more transaction records on different days are recommended to make a better classification model.
The future work I think will investigate the fraudulent cases that are failed to be detected by the model.